B 



THE RELATIONAL APPROACH TO THE MANAGEMENT OF DATA BASES 

MAC Technical Memorandum 23 

Alois J. Strnad 
April 1971 



Work reported herein was supported in part by 
Project MAC, an M. I .T. research project sponsored 
by the Advanced Research Projects Agency, Department 
of Defense, under Office of Naval Research Contract 
N00014-69-A-0276-0002. 



PROJECT MAC 
Massachusetts Institute of Technology 

Cambridge Massachusetts 02139 



Z3ZM AT AG 16 TH3M33AMM 1HT OT- W3*©*<WA JA!KHTAJ3fl 3J# 



£S «ubf*neBi*M feairfriaaT 36ft 









bam} 2 .1 «iofA 



:••• :■ ;•"'«. 




¥t#ti Is I9«t tf #•!' MM 



etfSO ***s«uri:>e<?ae« 



s$b<i 1<M&3 



Pa<re 1 



I . Introduction. 

The ultimate goal of Project MacAIMS (MAC Advanced 
Interactive Management System) is to build a computer 
facility which w! 1 1 be able to support non-trivial decision 
making processes. (See reference k) . In the early stages 
of our experiments we discovered that traditional approaches 
to the management of data bases do not satisfy our needs. 
We have determined the following requirements for the 
management of Large Data Bases (LDB) in a dynamically 
varying environment such as an interactive Management 
Information System: 

high degree of flexibility 

- data independence (i.e. programs are unaffected by 

changes of data representation or by addition 
of new data types) 
ability to operate on different data structures 

- access path Independent of data structure 
acces control below the file level 

uniform retrieval time and (possibly) update time 

The relational approach to management of LDB offers a 
solution which satisfies most of the above requirements. 
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This paper is concerned primarily with our implementation of 
this approach. The theoretical basis for the relational 
approach to management of LDB can be found in recently 
published papers (1,2). 

We are gradually implementing our system on MULTICS 
(Multiplexed Information and Computing Service). MULTICS is 
a major achievment in the area of mul t i -programming, 
multi-processing and time-sharing computer systems. It is a 
prototype of the computer utility, where "utility" is used 

in the same sense as the telephone or power utility. It can 
run seven days a week, twenty four hours a day, and it is 
capable of meeting a wide range of service demands. 

Two aspects of MULTICS are very important for our 
implementation -- memory organization and access control. 
In MULTICS, the traditional hierarchy of storage -,- ranging 
from core through disks, tapes etc.-- appears to the user as 
a single level, two-dimensional memory organization. Each 
user process has available up to 256,000 identical memories 
called segments, and each segment may contain up to 6^,000 
36-bit words. The entire range of this space (2 ) is 
directly addressable. 

The other important aspect of MULTICS for our purposes 
is the facility for access control. Again, a 
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two-dimensional approach has been taken. Fach segment has 
associated with it an access control list which identifies 
all the users authorized to use that segment and which 
specifies the mode of access for each user. The second 
dimension of access control is provided by concentric 
protection rings. Any attempt to access a segment from an 
insufficiently privileged ring (i.e. an outer ring) will 
cause a trap, and a routine supplied by the owner of the 
segment will be called to decide whether or not to permit 
the access. 

Let me also mention that our system is entirely 
programmed in PL/1, the primary language on MULTICS and the 
language in which MULTICS system itself is written. 



I I • General Information . 

We take the view that information we might store in our 
data base consists of sets of Data Elements (DE) and sets of 

relations among them. The basic set theoretic primitive 

operations are used for manipulating the Relation Data Sets 

(RDS) . 

Given Data Element Setss (DES) SI, S2 / ...Sn, 

corresponding Relation Data Sets consist of n-tuples (tuples 
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of degree n), each of which has its first element from set 
Si, its second element from S2, and so on. The Relation 
Descriptor (RD) is, in our terminology, the n-tuple composed 

of the names of the sets SI, S2, Sn . Suppose, for 

example, that there are Data Element Sets for persons' 
names, for addresses and for telephone numbers. We might 
construct an RDS which will represent the relations among 
members of these sets. The Relation Descriptor for this RDS 
will be the 3-tuple <person-name , address, tel ephone-number> . 
All other tuples will express the relation among the members 
of these sets. In our implementation, the rplations are 
stored exclusively in terms of Reference Numbers. 

The whole system is logically divided into two major 
parts. In the first, Data Element Sets are stored, 
Reference Numbers are assigned to the Data Elements, and 
operations are performed on DES's. In the second part of 
the system, Relation Data Sets are created and stored and 
basic set theoretic primitive operations are performed on 
them. 

Reference Numbers (identification numbers) play an 
important role in our implementation. Whenever a new DE 
enters the system, it is immediately assigned a Reference 
Number. The RM is used for all subsequent operations on 
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that DE. The method used for storing and assigning R M ' s to 
DE's guarantees that the particular HE is stored only one 
time within the system. This fact is a significant 
contribution to our goal of storage and operational 
efficiency. In our particular implementation, RM's are all 
36-bit quantities, which is the word length of the GE 6U5 
computer on which f^ULTICS runs. Because the most frequent 
operations are performed on the Relation Data Sets and 
because all relations are expressed in terms of RM's, the 
operational efficiency of using the fixed length numbers is 
obvious. We also save storage space since the average length 
of a Data Element is almost always greater than one machine 

word . 

Both parts of the system are described in detail in the 

following sections of this paper. 

Ml. Data Elem ent Sets. 

tn our system each category of data, such as name, 
address, telephone number, pointer, tuple, names of sets, 
etc. might be stored in one Data Element Set (DFS). Mote 
that tuples, pointers, etc. are treated in the same fashion 
as other categories. For example, persons' names can be 



stored in one DES, and the narre of this set is simply 
"person's narre". 

Moreover, the name of a set is just another data 
category. Therefore there is also one DES for narres of Data 
Element Sets, with its own name being "set name". The names 
of all DES's are stored in this DES. One reason for this 
provision is that the system is self-descriptive -- i.e. all 
information about the data is stored within the system 
itself. Futhermore this descriptive data is treated exactly 
the same way as any other data. This is one significant 
advantage to our approach. 

The total number of Data Element Sets depends on the 
particular application. For example, a Library Information 
System probably will have a DES for titles. 

The reader might be somewhat confused at this point, 
but figure 1 and the rest of the paper will give a clearer 
explanation of the above concepts. 

A Data Element Module (DEM) is a procedure which 

operates on a Data Element Set. Each DES may have its "own" 

DEM. We have defined the following functions of a DEM: 

read - read the input stream and produce the Standard Form 

(SF). For example, the input to the DEM for dates 

might be "8, Febr., 1969". The SF returned by this 
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entry is "19690208". Data Elements can be stored in 

the DES in SF only. 
write - write the given SF in natural, human-oriented 

form. Taking the Standard Form from the previous 

example, the output would be "February 8, 1969". 
get reference number - the Reference Number corresponding 

to the given SF will be returned. 
p;gt data element - the SF corresponding to the given RN 

will be returned . 
insert - the given SF will be inserted into the specified 

DES and RN assigned, 
del ete - the specified SF is deleted from the DFS. 

Corresponding functions of all DEM's exept "read" and 
"write" perform the same processing. Therefore a common 
procedure called a Data Strategy Module (DSM) may be invoked 
as an intermidiary between the DEM and the DES. However the 
mechanism for assigning the RN to the DE (actually to the 
SF) is different for each data type. Therefore there are 
different DSM's for each data type. The most freguently 
occuring data types are integers, character strings, and 
real numbers. For example, data categories such as salary 
or date have SF's which fall into the data type integer. 

The Standard Forms of Data Elements are physically 
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stored in the DES in the form of binary trres. The main 
distinction between them is that for integers the SF and RM 
a re i dent ica 1 . 

Figures 2 and 3 show the mechanism for storing the SF's 
and assigning the RN ' s to them for both types. 

As the Data Element Sets grow, the binary trees may 
become unbalanced, resulting in less efficient operation of 
the system. Programs which run in background mode handle 
this situation by rebalancing the trees. 

We have also implemented two "special" Reference 
Numbers which are very useful. One is used to indicate a 
null, or absent DE. The other one acts as a "wild card"-- 
that is, upon comparison with other Reference Numbers, it 
will match any RN regardless of value. In the following 
examples the null RN will be represented by "0" and the 
"wild card" by "*". 

The left part of figure 1 illustrates the interaction 
between procedures and Data Element Sets. Notice that the 
DEM's provide the direct interface between the outside world 
and the part of the system described above. 
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1 v - Relation Data Sets . 

In section II the notion of Relation Data Set ( RPS ) , 
Relation Descriptor (RD), relation, and tuple were 
introduced. In section III, the mechanism for keeping track 
of Data Elements (DE) and assigning Reference Numbers (RM) 
to them was presented. 

The exact structure of the RDS's is irrelevant to the 
theory underlying the relational approach to data 
management. However, we believe that data (tuples in the 
case of RDS) should be stored using the representation most 
efficient for the particular application. Therefore the 
procedure, called a Relation Strategy Module (RSM), which 
operates on a particular RDS is designed for a particular 
data structure. Thus there is an RSM for lists, an RSM for 
trees, an RSM for arrays, etc. Note the difference between 
DEM's and RSM's: each DEM operates on Data Element Sets and 
is designed for a particular category of DE's; each RSM 
operates on Relational Data Sets, and is designed for a 
particular data structure. 

The basic set primitive operations and several non-set 
theoretic operations are defined. These operations sre 
performed by an RSM on a particular ROS . Before we describe 
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these operations in detail, we present examples of two 
RDS's. For the purposes of the following explanations, it 
is sufficient to represent RDS's as arrays, and the array 
elements as DE ' s . Keep in mind, however, that other 
representations are possible and that, in any case, the DF's 
are actually represented by RN's. 



RDS (A) 



RDS (P) 



name 



c i ty prof . sal . 



Wels Cambr ens. 18000 

Owens Boston prog 16000 

Niles Waban prog 15000 

Jones Lynn dir. 2^000 

Strand Lexin. clerk 12000 



name phone age 

Jones 6205 30 

Smith 5U32 20 

Strand 5857 3k 

Wyly 6669 hi 



The above RDS's have the following properties: 

1. Each row represents an n-tuple. 

2. All rows are distinct. 

3. The ordering of rows and columns is immaterial 



In our implementation, all of the set primitive 
operations require two RDS's as input and produce a third 
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RDS which is the result of the operation. The other 
operations do not necessarily follow this same scheme. 

In order to have the capability of creating RDS's and 
fill ins them with data we have had to define two "non-set 
theoretic" operations: 



create set 



Input: RD 
Example: RD 
Resul t : 



Output: RDS 
<name / city / phone> 

RDS(C) 
name citv phone 



The result is an RDS with one tuple -- the RD 



insert tuple 

Input: RDS,tuple Output: RDS 

Example: RDS(C), tuple = <Gol ub, Boston ,8885U2> 
Result: RDS(C) 

name citv phone 

Golub Boston 888542 



The following operations are basic set primitives 




zx -*Sgp 



tppIw^h 
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Output : RDS 



'Hjhtcfatcilqp 

Input: RDS, RDS 
Example: RDS(A) ,RDS(E) 

Suppose we created RDS(E) 

name city profi sal. 

* * prog * 
Result: RDS(F) 

name , citv , prof . sal . 

Owens Boston prog 16000 
Niles Waban prog 15000 



Note the use of the "wild card". 



4! f ft rinse 



Input: RDS, RDS Output: RDS 

Example: RDS(A) ,RDS(F) 

Result: RDS(G) 

name cftv prof. sal, 

Wels Camb eng, 18000 

Jones Lynn dlr. 2*t000 

Strand Lexfn clerk 12000 
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projection 



Jnput: RDS,RDS 
Example: RDS(A) ,RDSCH) 

Suppo** we have 

name 
Resul t: 

name 



Qiftput: 8DS 

RDS(H) 
§?1 ■ 



RDS(I) 



Wels 

Owens 

Nlles 

Jones 

Strand 



18000 
16000 
15000 
2U000 
12000 



JJilH. 



Input: RDS,RDS Output: RDS 

Example: RDS(A) ,RDS(B) 

Result: RDS(J) 

name cttv prof. sal. ..Phone 



Jones Lynn dl r. 
Strand Lex in clerk 



ase 



2U000 6205 30 
12000 5857 31* 
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Output: RDS 



compos i t ion 

Input: RDS,RDS 
Example: RDS(A) , RDS( K) 

Suppose, we have the RDS(K) 

sal . sen ior i tv 

8500 2 
32000 k5 
16000 10 
12000 Ik 
24000 16 
Result: RDS(M) 

name city prof. seniority 
Owens Boston proe; 10 
Jones Lynn d i r . 16 
Strand Lexin clerk Ik 



Output :R0S 



cartesian product 

Input :RDS / RDS 
Example: RDS(B) ,RDS(N) 

Suppose we have the RDS(N) 

af f ? 1 iat . bul ding 
MIT 
MAC E-58 
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Result: 



RDS(O) 

nape phone ajtfi afflUt* fel&j 

Jones 6205 30 
Jones 6205 30 
Smith 5 43 2 20 
Smith 51*32 20 
Strand 5857 31* 
Strand 5857 *U 
Wyly 6669 h2 
Wyly 6669 k2 



MIT 





MAC 


E-58 


MIT 





MAC 


E-58 


MIT 





MAC 


E-58 


MIT 





MAC 


€-58 



union 



Input: RDS,RDS 
Example: RDS(B) ,RDS(P) 



Output: RDS 



Suppose, we created ROS(P) 
name phon« »ee 

Sames 5555 5 2 

-Strand 5857 Jk 
Buck 8765 6k 



$ 
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Result: RDS(T) 

name phone a^e 

Jones 6205 30 

Smith 5432 20 

Strand 5857 34 

Wyly 6669 42 
Sames 5555 52 
Buck 8765 64 

In order to improve efficiency and to make the usage of 
the system simpler we have defined several redundant 
operations i.e. they can be performed by several calls of 
primitive operations, 

sort 

get successor 

replace tuple 

As has been previously mentioned, a Relation Strategy 
Module is designed to operate on a particular data 
structure. However, one of our requirements for management 
of data bases is to have the capability to combine and 
operate on data which have different representations. To 
accomplish this, each RSM has the capability to process not 
only the particular structure for which it is designed, but 
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also to operate on the cannonical form of an R n S. A PSM is 
able to reproduce its "own" type of RPS in the canonical 
form, which in our current implementation is a list. 

V. Final Remarks. 

We believe that our implementation of a relational 
system will give a solution for the management of Large Data 
Bases which will be able to support our overal research 
objectives. For experimental purposes, we have developed a 

simple user interface, and we are just now starting to move 
data into the system. We expect that during a relatively 
short period of time we will know enough about the behavior 
and efficiency of the system that we will be able to make 
substantial improvements. We should be able to report the 
experimental results of this preliminary system in late 
February 1971. 

VI. List of Abbreviatios. 

DE - Data Element 
DES - Data Element Set 
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DEM - Data Element Module 
DSM - Data Strategy Module 

RDS - Relation Data Set 
RD - Relation Descriptor 
RDS - Relation Data Set 
RSM - Relation Strategy Module 

MULTICS - Multiplexed Information and Computing 
Serv i ce . 
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Binary Tree for Character Strings 



000000000000000000000000000000000001 

Mauldin 




Murphy 
110001 



In succeeding reference numbers, all bits to the left of the flag 
have been omitted for clarity. First left-most "one" bit is used 
as a flag to indicate the start of significance. 

Figure 2 
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Binary Tree for Integers 




Figure 3 
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